Data: rental listings of aparments in NY City
Goal: predict level of interests
Some observations/questions raised during data exploration (would need to check in real-world use-case):
Data is quite homogenous as restricted as apartments in NY city. This is like building a model for each local area, which may work well if there is a lot of data.
Is target normalised to the same listing period of length (e.g. number of days or weeks)? I suspect not as there is some correlation between listing id and interest level.
Use at least 2-year of historical data to build model to account for seasonal effects (e.g. school holidays, vacation)
Should also use out of time sampling and testing
Target distribution shows imbalanced classes.
table(d$interest_level)
##
## high low medium
## 3839 34284 11229
Interest level is quantitative and we can encode the targets as numerical values and build a regression model. Future work.
sapply(d, class)
## bathrooms bedrooms building_id created
## "numeric" "integer" "character" "character"
## description display_address features latitude
## "character" "character" "list" "numeric"
## listing_id longitude manager_id photos
## "integer" "numeric" "character" "list"
## price street_address interest_level
## "integer" "character" "character"
Train and test in the same periods of 3 months, so there could be information leak due to seasonal effects. Proportion of listings in each month is same in both train and test.
min(d$created)
## [1] "2016-04-01 22:12:41"
max(d$created)
## [1] "2016-06-29 21:41:47"
min(d_test$created)
## [1] "2016-04-01 22:23:31"
max(d_test$created)
## [1] "2016-06-29 21:55:35"
d <- mutate(d, month_created = month(as_datetime(created)))
d_test <- mutate(d_test, month_created = month(as_datetime(created)))
table(d$month_created)
##
## 4 5 6
## 16411 16626 16315
table(d_test$month_created)
##
## 4 5 6
## 24879 24987 24793
Listing IDs are identifiers so remove.
d <- select(d, -listing_id)
There are missing values for latitude and longitude (entered as 0), and outliers for prices. Leave them in for now as tree-based model can handle such cases
summary(select(d, bathrooms, bedrooms, latitude, longitude, price))
## bathrooms bedrooms latitude longitude
## Min. : 0.000 Min. :0.000 Min. : 0.00 Min. :-118.27
## 1st Qu.: 1.000 1st Qu.:1.000 1st Qu.:40.73 1st Qu.: -73.99
## Median : 1.000 Median :1.000 Median :40.75 Median : -73.98
## Mean : 1.212 Mean :1.542 Mean :40.74 Mean : -73.96
## 3rd Qu.: 1.000 3rd Qu.:2.000 3rd Qu.:40.77 3rd Qu.: -73.95
## Max. :10.000 Max. :8.000 Max. :44.88 Max. : 0.00
## price
## Min. : 43
## 1st Qu.: 2500
## Median : 3150
## Mean : 3830
## 3rd Qu.: 4100
## Max. :4490000
sum(d$latitude == 0)
## [1] 12
sum(d$longitude == 0)
## [1] 12
Intuitively we expect manager, buildings, and street address all to be predictive of the target. But they are categorical features with many values. Using them directly may lead to complex model that can overfit, so we try to convert them into vector of real values. However we must be careful to not leak information from train to validation during cross-validation, especially for managers / building with few listings. Use threshold to determinte when to use.
length(unique(d$building_id))
## [1] 7585
length(unique(d$manager_id))
## [1] 3481
length(unique(d$display_address))
## [1] 8630
length(unique(d$street_address))
## [1] 15358
In fact, we see that they all exhibit heavy tail distribution, meaning a top few instances have many listings and the majority have few listings.
inspect_categorical_var(d, 'building_id')
inspect_categorical_var(d, 'manager_id')
inspect_categorical_var(d, 'display_address')
Build features for each manager in terms of the interest levels to their properties. We see that managers vary in their listing performance, for example
qplot(x = mngr_prop_high, data=manager_features, geom='histogram', binwidth=0.1)
qplot(x = mngr_prop_low, data=manager_features, geom='histogram', binwidth=0.1)
We expect property location to contribute to interest level. This information is contained in lat and long, but we can also leverage display address to gauge the interest at street level. But to aggregate on street level, we need to see if there are enough houses on street.
street_features <- dplyr::count(d, display_address)
summary(street_features$n)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 1.000 1.000 5.719 3.000 438.000
print('top 10% percentile street')
## [1] "top 10% percentile street"
quantile(street_features$n, 0.9)
## 90%
## 10
So 10% of streets have at least 10 houses, we may build features for these top streets.
We zoom in in the centre.
d_zoom <- d %>% filter(longitude > -74.5 & longitude < -73.5 & latitude > 40.5 & latitude < 41)
ggplotly(ggplot(d_zoom, aes(longitude, latitude)) + geom_jitter(aes(color = interest_level), alpha=0.5, size=0.2))
Looking at density on each dimension we see that there are relatively higher concentration of high interest properties near the edge.
ggplot(d_zoom, aes(x = latitude, color=interest_level)) + geom_density()
ggplot(d_zoom, aes(x = longitude, color=interest_level)) + geom_density()
Can also look at medium and high only, but same information as prev plots.
ggplotly(filter(d_zoom, interest_level != 'low') %>% ggplot(aes(longitude, latitude)) + geom_jitter(aes(color = interest_level), alpha=0.5, size=0.5))
We augement features with manager, street, and building features.
Interest levels are ordinal, so convert to -2, 0, 2 and plot correlations. The newly features correlate much stronger to the target than original features.
d2 <- mutate(d_train, interest = ifelse(interest_level == 'high', 2, ifelse(interest_level == 'medium', 1, 0)))
numeric_features <- names(d2)[which(sapply(d2, is.numeric) | sapply(d2, is.integer))]
cor(select(d2, one_of(numeric_features[1:10]), interest),
method = "spearman", use="pairwise.complete.obs") %>% corrplot(method = "number")
cor(select(d2, one_of(numeric_features[11:20]), interest),
method = "spearman", use="pairwise.complete.obs") %>% corrplot(method = "number")
cor(select(d2, one_of(numeric_features[21:39]), interest),
method = "spearman", , use="pairwise.complete.obs") %>% corrplot(method = "number")
## Warning in one_of(numeric_features[21:39]): Unknown variables: `NA`
Let’s build a first pass model using numerical features only. We can deal with text features later and ensemble the models.
Since data is very imbalanced, we should use xgboost with case weights
The CV error:
params <- list(eta = 0.2, gamma = 0, max_depth = 6, min_child_weight = 2,
subsample = 1, colsample_bytree = 0.7,
num_class = 3, objective = "multi:softprob", eval_metric = "mlogloss")
xgb_d_train <- xgb.DMatrix(data.matrix(select(d_train, -interest_level)),
label=to_numeric(d_train$interest_level),
#weight=case_weights(d_train$interest_level),
missing=NA)
xgb_d_val <- xgb.DMatrix(as.matrix(select(d_val, -interest_level)),
label=to_numeric(d_val$interest_level), missing=NA)
xgb_cv <- xgb.cv(params, xgb_d_train, nfold = 5, nrounds = 200, early.stop.round = 3, nthread = 4)
## [0] train-mlogloss:0.975658+0.000430 test-mlogloss:0.978276+0.002567
## [1] train-mlogloss:0.891308+0.001711 test-mlogloss:0.896319+0.003286
## [2] train-mlogloss:0.826890+0.002722 test-mlogloss:0.834186+0.004270
## [3] train-mlogloss:0.778011+0.002824 test-mlogloss:0.787518+0.005751
## [4] train-mlogloss:0.740331+0.002795 test-mlogloss:0.751918+0.007003
## [5] train-mlogloss:0.709880+0.003059 test-mlogloss:0.723541+0.008167
## [6] train-mlogloss:0.684465+0.002456 test-mlogloss:0.700254+0.008787
## [7] train-mlogloss:0.665196+0.001879 test-mlogloss:0.682704+0.009882
## [8] train-mlogloss:0.648491+0.002415 test-mlogloss:0.667708+0.010396
## [9] train-mlogloss:0.635054+0.002453 test-mlogloss:0.655909+0.010157
## [10] train-mlogloss:0.623336+0.001712 test-mlogloss:0.645819+0.011107
## [11] train-mlogloss:0.614134+0.001842 test-mlogloss:0.638026+0.011630
## [12] train-mlogloss:0.605720+0.002608 test-mlogloss:0.631306+0.011715
## [13] train-mlogloss:0.598375+0.002192 test-mlogloss:0.625359+0.011649
## [14] train-mlogloss:0.591140+0.002253 test-mlogloss:0.619835+0.012330
## [15] train-mlogloss:0.585345+0.001637 test-mlogloss:0.615354+0.012511
## [16] train-mlogloss:0.579080+0.002636 test-mlogloss:0.610713+0.011939
## [17] train-mlogloss:0.573491+0.002345 test-mlogloss:0.606533+0.012236
## [18] train-mlogloss:0.568807+0.002532 test-mlogloss:0.603161+0.012168
## [19] train-mlogloss:0.564464+0.002699 test-mlogloss:0.600194+0.012240
## [20] train-mlogloss:0.560930+0.002898 test-mlogloss:0.597806+0.012083
## [21] train-mlogloss:0.557446+0.003100 test-mlogloss:0.595803+0.012002
## [22] train-mlogloss:0.554403+0.003009 test-mlogloss:0.593935+0.012145
## [23] train-mlogloss:0.551447+0.003563 test-mlogloss:0.592240+0.011909
## [24] train-mlogloss:0.548464+0.003755 test-mlogloss:0.590671+0.012105
## [25] train-mlogloss:0.545571+0.004182 test-mlogloss:0.589393+0.011777
## [26] train-mlogloss:0.543044+0.004346 test-mlogloss:0.588047+0.011581
## [27] train-mlogloss:0.540234+0.003755 test-mlogloss:0.586630+0.011844
## [28] train-mlogloss:0.537753+0.003585 test-mlogloss:0.585549+0.011968
## [29] train-mlogloss:0.535315+0.003391 test-mlogloss:0.584541+0.012197
## [30] train-mlogloss:0.532663+0.003374 test-mlogloss:0.583288+0.012317
## [31] train-mlogloss:0.530812+0.003740 test-mlogloss:0.582530+0.012134
## [32] train-mlogloss:0.528689+0.003750 test-mlogloss:0.581529+0.012394
## [33] train-mlogloss:0.526245+0.003710 test-mlogloss:0.580697+0.012293
## [34] train-mlogloss:0.524128+0.003750 test-mlogloss:0.579800+0.012347
## [35] train-mlogloss:0.522219+0.003444 test-mlogloss:0.579116+0.012306
## [36] train-mlogloss:0.520660+0.003472 test-mlogloss:0.578578+0.012306
## [37] train-mlogloss:0.519154+0.003623 test-mlogloss:0.578059+0.012303
## [38] train-mlogloss:0.517316+0.003468 test-mlogloss:0.577508+0.012354
## [39] train-mlogloss:0.515643+0.003364 test-mlogloss:0.576867+0.012402
## [40] train-mlogloss:0.514125+0.003555 test-mlogloss:0.576361+0.012395
## [41] train-mlogloss:0.512855+0.003614 test-mlogloss:0.575968+0.012409
## [42] train-mlogloss:0.510892+0.003478 test-mlogloss:0.575567+0.012591
## [43] train-mlogloss:0.509356+0.003455 test-mlogloss:0.575082+0.012532
## [44] train-mlogloss:0.508112+0.003257 test-mlogloss:0.574789+0.012644
## [45] train-mlogloss:0.506126+0.003169 test-mlogloss:0.574399+0.012756
## [46] train-mlogloss:0.504673+0.003423 test-mlogloss:0.573898+0.012839
## [47] train-mlogloss:0.503133+0.003507 test-mlogloss:0.573637+0.012888
## [48] train-mlogloss:0.501522+0.003603 test-mlogloss:0.573297+0.012886
## [49] train-mlogloss:0.499806+0.003924 test-mlogloss:0.572882+0.012953
## [50] train-mlogloss:0.498512+0.003668 test-mlogloss:0.572587+0.013059
## [51] train-mlogloss:0.497459+0.003838 test-mlogloss:0.572391+0.013004
## [52] train-mlogloss:0.495963+0.003772 test-mlogloss:0.571982+0.013247
## [53] train-mlogloss:0.494604+0.004112 test-mlogloss:0.571800+0.013221
## [54] train-mlogloss:0.493227+0.004354 test-mlogloss:0.571670+0.013281
## [55] train-mlogloss:0.491718+0.004465 test-mlogloss:0.571347+0.013369
## [56] train-mlogloss:0.490451+0.004008 test-mlogloss:0.571079+0.013434
## [57] train-mlogloss:0.488950+0.003803 test-mlogloss:0.570800+0.013453
## [58] train-mlogloss:0.487356+0.003738 test-mlogloss:0.570614+0.013497
## [59] train-mlogloss:0.486000+0.003789 test-mlogloss:0.570397+0.013585
## [60] train-mlogloss:0.484655+0.003775 test-mlogloss:0.570149+0.013511
## [61] train-mlogloss:0.483656+0.003801 test-mlogloss:0.569981+0.013575
## [62] train-mlogloss:0.482649+0.003641 test-mlogloss:0.569836+0.013569
## [63] train-mlogloss:0.481344+0.003744 test-mlogloss:0.569783+0.013715
## [64] train-mlogloss:0.479968+0.003893 test-mlogloss:0.569673+0.013719
## [65] train-mlogloss:0.478737+0.003780 test-mlogloss:0.569546+0.013598
## [66] train-mlogloss:0.477278+0.003814 test-mlogloss:0.569390+0.013506
## [67] train-mlogloss:0.475840+0.003847 test-mlogloss:0.569206+0.013539
## [68] train-mlogloss:0.474472+0.003706 test-mlogloss:0.568915+0.013648
## [69] train-mlogloss:0.473138+0.003746 test-mlogloss:0.568809+0.013734
## [70] train-mlogloss:0.472044+0.003487 test-mlogloss:0.568647+0.013801
## [71] train-mlogloss:0.470815+0.003778 test-mlogloss:0.568539+0.013922
## [72] train-mlogloss:0.469652+0.003934 test-mlogloss:0.568433+0.013879
## [73] train-mlogloss:0.468603+0.004175 test-mlogloss:0.568296+0.013916
## [74] train-mlogloss:0.467411+0.004046 test-mlogloss:0.568290+0.014004
## [75] train-mlogloss:0.466149+0.003956 test-mlogloss:0.568114+0.013943
## [76] train-mlogloss:0.465339+0.003658 test-mlogloss:0.568005+0.013991
## [77] train-mlogloss:0.463978+0.003922 test-mlogloss:0.567910+0.014069
## [78] train-mlogloss:0.462658+0.004264 test-mlogloss:0.567847+0.014041
## [79] train-mlogloss:0.461700+0.004212 test-mlogloss:0.567769+0.014106
## [80] train-mlogloss:0.460921+0.004252 test-mlogloss:0.567733+0.014124
## [81] train-mlogloss:0.460093+0.004161 test-mlogloss:0.567685+0.014119
## [82] train-mlogloss:0.459198+0.004137 test-mlogloss:0.567670+0.014064
## [83] train-mlogloss:0.458149+0.004068 test-mlogloss:0.567595+0.014066
## [84] train-mlogloss:0.457133+0.004007 test-mlogloss:0.567546+0.014109
## [85] train-mlogloss:0.456097+0.003927 test-mlogloss:0.567575+0.014126
## [86] train-mlogloss:0.455198+0.003818 test-mlogloss:0.567499+0.014133
## [87] train-mlogloss:0.454256+0.003832 test-mlogloss:0.567522+0.014231
## [88] train-mlogloss:0.453194+0.003850 test-mlogloss:0.567455+0.014160
## [89] train-mlogloss:0.451930+0.003552 test-mlogloss:0.567428+0.014235
## [90] train-mlogloss:0.450709+0.003518 test-mlogloss:0.567446+0.014219
## [91] train-mlogloss:0.449615+0.003564 test-mlogloss:0.567445+0.014346
## [92] train-mlogloss:0.448628+0.003476 test-mlogloss:0.567423+0.014340
## [93] train-mlogloss:0.447457+0.003516 test-mlogloss:0.567432+0.014323
## [94] train-mlogloss:0.446311+0.003508 test-mlogloss:0.567525+0.014353
## [95] train-mlogloss:0.445222+0.003437 test-mlogloss:0.567527+0.014361
## Stopping. Best iteration: 93
xgb_model <- xgb.train(params, xgb_d_train,
nrounds = 500, early.stop.round = 3, nthread = 4,
watchlist = list(val = xgb_d_val))
## [0] val-mlogloss:0.979630
## [1] val-mlogloss:0.898859
## [2] val-mlogloss:0.840962
## [3] val-mlogloss:0.798048
## [4] val-mlogloss:0.761406
## [5] val-mlogloss:0.732785
## [6] val-mlogloss:0.711867
## [7] val-mlogloss:0.693052
## [8] val-mlogloss:0.679435
## [9] val-mlogloss:0.669895
## [10] val-mlogloss:0.658097
## [11] val-mlogloss:0.651483
## [12] val-mlogloss:0.643463
## [13] val-mlogloss:0.635880
## [14] val-mlogloss:0.631997
## [15] val-mlogloss:0.627317
## [16] val-mlogloss:0.624271
## [17] val-mlogloss:0.620170
## [18] val-mlogloss:0.617388
## [19] val-mlogloss:0.615234
## [20] val-mlogloss:0.614006
## [21] val-mlogloss:0.609879
## [22] val-mlogloss:0.608206
## [23] val-mlogloss:0.606996
## [24] val-mlogloss:0.605774
## [25] val-mlogloss:0.604661
## [26] val-mlogloss:0.603672
## [27] val-mlogloss:0.602863
## [28] val-mlogloss:0.601464
## [29] val-mlogloss:0.600711
## [30] val-mlogloss:0.599235
## [31] val-mlogloss:0.598486
## [32] val-mlogloss:0.597323
## [33] val-mlogloss:0.596512
## [34] val-mlogloss:0.595787
## [35] val-mlogloss:0.595490
## [36] val-mlogloss:0.594993
## [37] val-mlogloss:0.594946
## [38] val-mlogloss:0.594590
## [39] val-mlogloss:0.594271
## [40] val-mlogloss:0.593994
## [41] val-mlogloss:0.593954
## [42] val-mlogloss:0.593739
## [43] val-mlogloss:0.593048
## [44] val-mlogloss:0.592291
## [45] val-mlogloss:0.591669
## [46] val-mlogloss:0.591511
## [47] val-mlogloss:0.591226
## [48] val-mlogloss:0.590998
## [49] val-mlogloss:0.590114
## [50] val-mlogloss:0.589879
## [51] val-mlogloss:0.589760
## [52] val-mlogloss:0.589245
## [53] val-mlogloss:0.589075
## [54] val-mlogloss:0.588898
## [55] val-mlogloss:0.588851
## [56] val-mlogloss:0.588768
## [57] val-mlogloss:0.588583
## [58] val-mlogloss:0.588290
## [59] val-mlogloss:0.588096
## [60] val-mlogloss:0.588068
## [61] val-mlogloss:0.587981
## [62] val-mlogloss:0.587833
## [63] val-mlogloss:0.587991
## [64] val-mlogloss:0.587849
## [65] val-mlogloss:0.588040
## Stopping. Best iteration: 63
importance <- xgb.importance(feature_names = colnames(d_train), model = xgb_model)
xgb.plot.importance(importance)
#plot(varImp(xgb_model))
How well does the model predict?
val_prob <- matrix(predict(xgb_model, xgb_d_val), ncol = 3, byrow=T) %>% data.frame()
colnames(val_prob) <- c("low", "medium", "high")
val_prob <- data.frame(high = val_prob$high, low=val_prob$low, medium=val_prob$medium)
val_label <- probs_to_label(val_prob)$label
confusionMatrix(d_val$interest_level, val_label)
## Confusion Matrix and Statistics
##
## Reference
## Prediction high low medium
## high 224 267 468
## low 43 7926 602
## medium 149 1663 995
##
## Overall Statistics
##
## Accuracy : 0.7413
## 95% CI : (0.7334, 0.749)
## No Information Rate : 0.7989
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.36
## Mcnemar's Test P-Value : <2e-16
##
## Statistics by Class:
##
## Class: high Class: low Class: medium
## Sensitivity 0.53846 0.8042 0.48184
## Specificity 0.93834 0.7400 0.82360
## Pos Pred Value 0.23358 0.9247 0.35447
## Neg Pred Value 0.98313 0.4875 0.88772
## Prevalence 0.03372 0.7989 0.16738
## Detection Rate 0.01816 0.6425 0.08065
## Detection Prevalence 0.07773 0.6947 0.22753
## Balanced Accuracy 0.73840 0.7721 0.65272
sprintf("log loss = %.4f", logloss(d_val$interest_level, val_prob))
## [1] "log loss = 0.5880"
Model is predicting high & low better than medium, even though high has much lower prevalence (only 4%). But this makes sense as we found in our prior analysis that there are few features that distinguish high from low and medium. Might perform even better with one vs. all.